1 Introduction

Spotify is one of the larger music streaming services available today with 345 million active users1. Instead of having to buy cds or download every song to listen to, Spotify allows access to millions of songs without having to download them on electronic devices.

In our project, we want to answer if energy, acousticness, loudness, danceability, and liveness have a specific pattern over the years. In addition, our other question is if a feature has a strong correlation to certain other features. Certain features will have strong patterns relating to the year and some of the features will be strongly correlated to other features. We especially think that energy and danceability will have a strong correlation, along with liveness and energy.

2 Background

The data we are using is based on Spotify data from 1921 to 2020 including over 175,000 audio tracks. We found our data on Kaggle2. This dataset groups the data by artist, genre, and year. There are nine different variables measured in the dataset. They are acousticness, danceability, duration, energy, liveness, instrumentalness, loudness, speechiness and tempo.

For our project, we decided to focus on energy, acousticness, liveness, loudness and danceability. Energy is a perceptual measure of the intensity and activity of a track on a scale from 0.0 to 1.0. Some of the perceptual features that are included in this are dynamic range, perceived loudness, timbre, onset rate, and general entropy. Liveness ranges from 0 to 1 and detects if an audience is present in a recording. If the liveness value is above 0.8, there is a strong likelihood that the track is live. Acousticness is the confidence measure of the track being acoustic. It varies from 0.0 to 1.0, with 1.0 representing high confidence that the track is acoustic. Loudness ranges from -60 to 0 and is measured in decibels (dB). It suggests the overall loudless averaged over the entire track. Lastly, the measure of danceability includes a combination of tempo, rhythm stability, beat strength and regularity. It rates how suitable a track is for dancing from 0.0 to 1.0 with 1 being the most danceable.

In the rest of our report, we intend to first graph each feature by year and add a linear regression line to see if there are any trends over the years. Then, we will test the correlations between two features to see if they are strongly related or not related. In the end, we hope to discover how different features have changed over the years and how music has evolved.

3 Analysis

3.1 Genre Condensation

There was 3232 genres. We condensed these into the top 20 occurring terms in these genres using regular expressions and counting the occurrences.

3.1.1 All genres

Could include all genres here just to show

3.1.2 Top 100

Could show all generes and counts here (or like 100) just to show

3.1.3 Top 20

3.1.4 Final Condensed Gernre Dataset

We use these top 20 to create a more concisely labeled dataset (along with the label other).

3.2 r-values

3.2.1 All combinations raw

ALl possible combinations of features to create many r-values.

3.2.2 r-values

Just display r-value on graphs:

3.2.2.1 Upper number

3.2.2.2 Upper color

3.2.2.3 Number / Color

3.3 Genre Graphs

3.3.1 Condensed genre conunts

3.3.2 Condensed genre conunts (only top 20) (colored)

3.3.3 Genres vs instrumentalness (geom_points)

3.3.4 Genres vs instrumentalness (geom_col)

3.3.5 Genres vs instrumentalness (geom_boxplot)

3.4 Feature comparisons

3.4.1 Graphs of Combinations of Features

(All / many) combinations graphs:

3.4.1.1 Select few

3.4.1.2 All

4 Discussion

A short-coming of our analysis is that we do not know how many songs are included in the data for each year. Some year’s data may be based on more songs than other years.

Future work on this dataset could involve testing out more of the features relationships and seeing if they have strong models. We could also look for datasets from other music streaming services, such as Apple Music and Pandora.

5 Analysis [DO NOT INCLUDE]

Your data analysis should include substantial data exploration including graphical and numerical summaries which do not appear in the final report. You may exclude such analysis by using include=FALSE in the corresponding R chunks.

5.1 Simplify genre categories

genre <- read_csv("data/data_by_genres.csv", col_types = cols())
x <- genre %>% 
  select(genres) %>% 
  mutate(
    words = str_split(genres, "\\s"),
  )
all_terms <- tibble(term = unlist(x$words, recursive = FALSE))
# Top terms (double counts...)
genre_term_count <- all_terms %>% 
  group_by(term) %>% 
  summarise(n = n()) %>% 
  arrange(desc(n)) %>% 
  head(20)
# higher in case when get priority
condensed <- genre %>% 
  mutate(
    simple_genre = case_when(
      str_detect(genres, genre_term_count$term[1]) ~ genre_term_count$term[1],
      str_detect(genres, genre_term_count$term[2]) ~ genre_term_count$term[2],
      str_detect(genres, genre_term_count$term[3]) ~ genre_term_count$term[3],
      str_detect(genres, genre_term_count$term[4]) ~ genre_term_count$term[4],
      str_detect(genres, genre_term_count$term[5]) ~ genre_term_count$term[5],
      str_detect(genres, genre_term_count$term[6]) ~ genre_term_count$term[6],
      str_detect(genres, genre_term_count$term[7]) ~ genre_term_count$term[7],
      str_detect(genres, genre_term_count$term[8]) ~ genre_term_count$term[8],
      str_detect(genres, genre_term_count$term[9]) ~ genre_term_count$term[9],
      str_detect(genres, genre_term_count$term[10]) ~ genre_term_count$term[10],
      str_detect(genres, genre_term_count$term[11]) ~ genre_term_count$term[11],
      str_detect(genres, genre_term_count$term[12]) ~ genre_term_count$term[12],
      str_detect(genres, genre_term_count$term[13]) ~ genre_term_count$term[13],
      str_detect(genres, genre_term_count$term[14]) ~ genre_term_count$term[14],
      str_detect(genres, genre_term_count$term[15]) ~ genre_term_count$term[15],
      str_detect(genres, genre_term_count$term[16]) ~ genre_term_count$term[16],
      str_detect(genres, genre_term_count$term[17]) ~ genre_term_count$term[17],
      str_detect(genres, genre_term_count$term[18]) ~ genre_term_count$term[18],
      str_detect(genres, genre_term_count$term[19]) ~ genre_term_count$term[19],
      str_detect(genres, genre_term_count$term[20]) ~ genre_term_count$term[20],
      TRUE ~ "other"
    )
  )

# condensed %>% 
#   group_by(simple_genre) %>% 
#   summarise(n = n()) %>% 
#   arrange(desc(n))

5.2 All feature vs. feature

# for (var_name in colnames(year)[2:12]) {
#   condensed %>%
#     ggplot(aes(y = simple_genre, x = get(condensed, var_name))) +
#     geom_boxplot() %>%
#     print()
# }

5.3 All r-values

year <- read_csv("data/data_by_year.csv", col_types = cols())

# All r values
allr <- year %>%
  select(-c(year, key, mode)) %>%
  cor() %>%
  round(digits = 2) %>%
  data.frame()

library(corrr)
allr2 <- year %>% 
  select(-c(year, key, mode)) %>% 
  correlate()

allr2
# allr2 %>% 
#   mutate(
#     acousticness = case_when(
#       abs(acousticness) > 0.5 ~ acousticness,
#       TRUE ~ NA,
#     )
#   )

allr3 <- allr2

threshold = 0.7
b1 <- allr3 < threshold
b2 <- allr3 > -threshold

allr3[b1 & b2] <- NA # only check values that are above threshold for r
allr3

5.4 Displaying all r-values

# All graphs
# for (a in colnames(year)[2:12]) {
#   for (b in colnames(year)[2:12]) {
#     if (a > b) {
#       g <- year %>%
#         summarise(
#           a_var = get(a),
#           b_var = get(b)
#         ) %>% 
#         ggplot(aes(x = a_var, y = b_var)) +
#         geom_point() +
#         labs(
#           x = a,
#           y = b
#           )
#       print(g)
#     }
#   }
# }

# https://statsandr.com/blog/correlation-coefficient-and-correlation-test-in-r/

# only some
pairs(year[,c(5, 2, 7, 8)])

# all
pairs(year[2:12])

# > install.packages("corrplot")
library(corrplot)

year_sh <- year %>% 
  rename(
    yr = year,
    ac = acousticness,
    db = danceability,
    dur = duration_ms,
    en = energy,
    ins = instrumentalness,
    li = liveness,
    lo = loudness,
    sp = speechiness,
    tmp = tempo,
    val = valence,
    pop = popularity,
    k = key,
    m = mode
  )

corrplot.mixed(
  round(cor(year_sh[2:12]), 1),
  lower = "number",
  upper = "color"
)

6 References